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Abstract. We show how, given a straight-line program with g rules for a binary string B of length n, 
in 0(g 2 / 3 n 4 / 3 ) time we can build a (2nHo(B) + o(n))-bit index such that, given m and c, in 0(1) time 
we can determine whether there is a substring of B with length m containing exactly c copies of 1. If 
we use O(nlogn) bits for the index, then we can list all such substrings using 0(ra) time per substring. 

1 Introduction 

Motivated by problems in mass spectrometry, several researchers have studied the problem of 
jumbled pattern matching, i.e., finding substrings of a given text that consist of a given multiset 
of characters. Building indexes for jumbled pattern matching has turned out to be a challenging 
problem: currently, we can build reasonably-sized indexes only for binary strings and, even then, 
the construction time is superlinear and the indexes report only whether there exists a matching 
substring. 

Cicalese, Fici and Liptak [5 J showed how, given a binary string S[l..n], in 0(n 2 ) time we can 
build a linear-space index such that, given m and c, in 0(1) time we can determine whether there 
is a substring of B with length m containing exactly c copies of 1. Their key observation was that, 
if one substring of length m contains fewer than c copies of 1 and another contains more, then a 
third such substring must contain exactly c copies. Their index is a table T saying, for 1 < m < n, 
how many and how few Is there can be in a substring of length m. 

Burcsi, Cicalese, Fici and Liptak [2] (see also |3|4j ) and Moosa and Rahman |10j independently 
reduced Cicalese, Fici and Liptak's construction time to 0{n 2 / log n) , then Moosa and Rahman [IT] 
reduced it to O (n 2 / log 2 rij in the RAM model. Cicalese, Laber, Weimann and Yuster [6] showed 

how to build in 0{n l+e ) time an approximate index, which may return false positives when the 
query is close (in a sense depending on e) to one for which there is a matching substring. Badkobeh, 
Fici, Kroon and Liptak pQ showed how to build in O (r 2 log r) time a potentially smaller index that 
answers queries in O(logn) time, where r is the number of runs in B (i.e., maximal substrings 
containing only Os or only Is). Recently, Giaquinta and Grabowski [9] gave several time-space 
tradeoffs based on Badkobeh et al.'s construction. 

Suppose that, as we receive B as input, we store it as a compressed bitvector that takes 
0(uHq{B)) + o(n) bits, where H${B) < 1 is the Oth-order empirical entropy of B. With this 
bitvector, we can answer rank queries on B in 0(1) time; given a position, a rank query returns the 
number of Is in the prefix ending at that position. We can loop through all 0(n 2 ) substrings of B 
in increasing order of length and count the number of Is in each substring using two rank queries. 
This way, in 0(n 2 ) time we can compute T row by row using 0{jiHq(B)) + o(n) bits of workspace. 

Fici and Liptak [7] (see also [T]) observed that, if we increase m by 1, then the minimum number 
of Is can only stay the same or increment, as can the maximum number of Is. Therefore, we can 



encode T as two n-bit binary strings B m [ Q and B max in which Is indicate increments. Given m 
and c, in 0(1) time we can determine whether there is a substring of B with length m containing 
exactly c copies of 1, by checking whether 2? m i n .rank(m) < c < -B max .rank(m). We now note that 
J B min .rank(n) = 2? max .rank(n) = B.rank(n), so H (B min ) = H (B mauX ) = H Q (B). Therefore, if we 
store -B m in and -B max as compressed bitvectors, then they take a total of 2uHq(B) + o(n) bits and 
support rank queries in 0(1) time. We can build these compressed bitvectors incrementally as we 
compute T row by row, so we still use 0{uHq{B)) + o{n) bits of workspace overall. 

In this paper we show how, given a straight-line program (SLP) for B with g rules (i.e., a 
context-free grammar with g rules in Chomsky normal form that generates B and only B), we can 
build Cicalese, Fici and Liptak's table T in o(^g 2 ^ 3 n 4 ^ 3 ^j time. Informally, a string has a small SLP 
if LZ77 compression works well on it, and vice versa [12]; by the analysis of LZ78, we can assume 
g = 0(n/ log n) so our time bound is O (n 2 /log 2 / 3 nj even when B is not compressible. We also 
show how, by indexing substrings and using a total of O(nlogn) bits, we can list all substrings of 
length m containing exactly c copies of 1 using 0(m) time per substring. In the full version of this 
paper we will improve our construction time slightly by combining our approach with Moosa and 
Rahman's; we will also show that our construction still takes 0(nHo(B)) + o(n) bits of workspace. 

2 Construction 

Suppose we have a SLP S for B with g rules. Gawrychowski [8] recently showed how, for any £, 
in 0(g + n/£) time we can turn S into an SLP S' with 0{g) rules such that all new nonterminals 
expand into strings of length at most £ and B can be written as the concatenation of 0(n/£) new 
nonterminals' expansions. It follows that, setting £ = {n/g) 2 / 3 , in o(^g + g 2 ^n 1 ^ 3 ^j = 0{n) time we 

can split B into O^g 2 / 3 /! 1 / 3 ^ blocks B\, . . . , B^ of length at most {n/g) 2 / 3 such that there are O(g) 
distinct blocks Bi, . . . , Bd- 

Lemma 1. In 0{n) time we can split B into o(^g 2 / 3 n l / 3 ^j blocks of length at most {n/g) 2 / 3 such 
that there are 0{g) distinct blocks. 

For each distinct block Bi, 1 < i < b, we build a table Tj saying, for each length m at most the 
length of B;, how many and how few Is there can be in a substring of B; with length m. This takes 
a total of o(ji 4 / 3 / g 1 / 3 ^ time. For each possible pair of distinct blocks Bj and Bj, 1 < i, j < b, we 
build a table Ty saying, for each length m at most the length of their concatenation BjBj, how 
many and how few Is there can be in a substring of BiBj with length m that starts in Bi and ends 

in Bj. This takes a total of 0(g 2 / 3 n 4 / 3 ) time. 

For 1 < i < j < b, we build a table ly saying, for each length m at most the length of the 
concatenation Bi - ■ ■ Bj, how many and how few Is there can be in a substring of Bi ■ ■ ■ Bj with 
length m that starts in Bi and ends in Bj. If j = i then is the table for the distinct block of 
which Bi is a copy. If jf = i + 1 then Tj j is the table for the pair of distinct blocks of which Bi and 
Bj are copies. Only when j > i + 1 do we need to build a new table. 

Let mj + ij_i be the length of Bi + \ ■ ■ ■ Bj_\ and let Cj + ij_i be the number of Is in Bi + \ ■ ■ ■ Bj_\. 
Let B' be a substring of Bi - ■ ■ Bj and let B" be the substring of BiBj obtained by removing 
B; l+ i ■ ■ ■ Bj-i from B' . Notice that the length of B' is the length of B" plus m-i+ij-i, and the 
number of Is in B' is the number of Is in B" plus Cj+ij_i. It follows that we can build the table 
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Tij by copying the table for the pair of distinct blocks of which B{ and Bj are copies, adding 
Tfk+lj-i to all the lengths and adding Cj+ij-i to all counts of Is. Building all such tables takes a 
total of 0(g 2 / 3 n 3 / 4 ^ time. Merging them to form T also takes 0(g 2 / 3 n 3//4 ^ time. 

Lemma 2. We can build T in 0^<7 2 / 3 n 3 / 4 ^ time. 

This lemma immediately implies the following theorem. In the full version of this paper, we 
will improve these results slightly by using Moosa and Rahman's result to build the tables for the 
blocks and pairs of blocks. 

Theorem 1. Given an SLP with g rules for B[l..n], in o(^g 2 / 3 n A / 3 ^J time we can build a (2nHo(B) + 
o{n))-bit index such that, given m and c, in 0(1) time we can determine whether there is a substring 
of B with length m containing exactly c copies of 1. 

3 Listing 

If we are willing to increase our space usage by a logarithmic factor, then it is not difficult to 
turn indexes for detecting jumbled pattern matches into indexes for locating them. For the sake 
of simplicity, assume n is a power of 2. For 1 < i < logn — 1 and < j < 2 l — 2, we build 
an index for detecting jumbled pattern matches in B[jn/2 l + l..(j + 2)n/2 1 ]. That is, we build 
indexes for B[l..n],B[l..n/2],B[n/4 + 1..3n/4],B[n/2 + l..n], B[l..n/4], B[n/8 + 1..3n/8],B[n/4 + 
l..n/2],B[3n/8 + 1..5n/8],S[n/2 + 1..3n/4], . . .. These indexes take a total of O(nlogn) bits. 

We can visualize this as logn — 1 sets of overlapping segments: the first set consists of the 
segment B[l..n], i.e., the whole string; for 2 < i < n, the ith set consists of two layers of disjoint 
segments of length n/2 4-1 , with the first layer starting at position 1 and the second layer starting 
at position n/2* + 1. (We use "segment" instead of "block" to avoid confusion with Section [2j) 
Notice that, for any substring of length m and any set whose segments have length at least 2m, 
that substring is completely included in at least 1 segment and at most 2 segments in that set. 

Given m and c, we first query the index for the whole string B[l..n]. For 2 < i < [logmj — 1, 
if we found a segment B' in the (i — l)st set that contains a match, then we query the indexes for 
the segments in the ith set that are completely contained in B' . Let occ be the number of jumbled 
pattern matches for m and c. We make O(occlogn) queries to indexes for segments, which takes 
O(occlogn) time, and find O(occ) segments of size at most 2m that together completely include all 
the jumbled pattern matches for m and c. We then scan those segments and find all the matches, 
which takes C(occ(m + logn)) time. 

We can reduce our time bound to O(occm) by using a different strategy when m < logn. For 
1 < m < logn and < c < m, we store a bitvector with Is indicating which segments in the 
([logmj — l)st set contain a substring of length m with c Is. Each bitvector takes 0(n/m) space, 
so we use O(nlogn) bits in total. Given m < logn and c, we use 0(l)-time select queries on the 
appropriate bitvector to find the segments to scan. 

Straightforward calculation shows that, if we index each segment in time quadratic in its length, 
then we use C(n 2 ) time overall. To index all the segments in 0^g 2 / 3 n 4 / 3 ^ time, for each segment 
longer than g, we build an SLP for that segment with 0(n') rules, where n' is the length of the 
segment. If n' > g, then we do this by restricting the original SLP to generate only the segment, 
which introduces 0{g) new nonterminals. If n' < g, then we build the new SLP from scratch. Once 
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we have the new SLP, by Theorem [IJ we can build the table for it in O 

^2/3 (n /)4/3j time 

bumming 

up over all of the segments, we use a total of 0(g 2 / 3 n 4 / 3 ^ time. 

Theorem 2. Given an SLP with g rules for B [I.. n), in o(^g 2 ^n A ^ 3 ^j time we can build an O(nlogn)- 
bit index such that, given m and c, we can list all substrings of length m containing exactly c copies 
of 1 using 0{m) time per substring. 

4 Future Work 

As noted before, in the full version of this paper we will improve Lemma [2] and Theorem [T] (and, 
thus, Theorem[2]) slightly by using Moosa and Rahman's result to build the tables for the blocks and 
pairs of blocks. We will also show that our construction takes 0{uHq{B)) + o(n) bits of workspace. 
Finally, we will also consider some new problems. For example, suppose we have a tree whose nodes 
are labelled with characters from a constant-sized alphabet. If we build a perfect hash table for 
the Parikh vectors of all the subtrees, then we use 0{n) space (measured in words) and later we 
can list the subtrees with a given Parikh vector using 0(1) time per subtree. This approach can be 
applied to jumbled pattern matching in strings, as well, but in general the space usage increases to 
C(n 2 ) (or to o(n 2d ^j for d-dimensional arrays). 

Acknowledgments 

Many thanks to Ferdinando Cicalese, Emanuele Giaquinta, Szymon Grabowski, Kalle Karhu, 
Zsuzsanna Liptak, Simon Puglisi and Jorma Tarhio, for helpful comments. 



References 

1. G. Badkobeh, G. Fici, S. Kroon, and Z. Liptak. Binary jumbled string matching: Faster indexing in less space. 
Technical Report 1206.2523, arxiv.org, 2012. 

2. P. Burcsi, F. Cicalese, G. Fici, and Z. Liptak. On table arrangement, scrabble freaks, and jumbled pattern 
matching. In Proceedings of the Symposium on Fun with Algorithms, pages 89-101, 2010. 

3. P. Burcsi, F. Cicalese, G. Fici, and Z. Liptak. Algorithms for jumbled pattern matching in strings. International 
Journal of Foundations of Computer Science, 23(2):357-374, 2012. 

4. P. Burcsi, F. Cicalese, G. Fici, and Z. Liptak. On approximate jumbled pattern matching in strings. Theory of 
Computing Systems, 50(1):35-51, 2012. 

5. F. Cicalese, G. Fici, and Z. Liptak. Searching for jumbled patterns in strings. In Proceedings of the Prague 
Stringology Conference, pages 105-117, 2009. 

6. F. Cicalese, E. S. Laber, O. Weimann, and R. Yuster. Near linear time construction of an approximate index for 
all maximum consecutive sub-sums of a sequence. In Proceedings of the Symposium on Combinatorial Pattern 
Matching, pages 149-158, 2012. 

7. G. Fici and Z. Liptak. One prefix normal words. In Proceedings of the 15th Conference on Developments in 
Language Theory, pages 228-238, 2011. 

8. P. Gawrychowski. Faster algorithm for computing the edit distance between SLP-compressed strings. In Pro- 
ceedings of the Symposium on String Processing and Information Retrieval, pages 229-236, 2012. 

9. E. Giaquinta and S. Grabowski. New algorithms for binary jumbled pattern matching. Technical Report 
1210.6176, arxiv.org, 2012. 

10. T. M. Moosa and M. S. Rahman. Indexing permutations for binary strings. Information Processing Letters, 
110(18-19):795-798, 2010. 

11. T. M. Moosa and M. S. Rahman. Sub-quadratic time and linear space data structures for permutation matching 
in binary strings. Journal of Discrete Algorithms, 10:5-9, 2012. 

12. W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. The- 
oretical Computer Science, 302(1-3) :211-222, 2003. 



4 



